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IV 


1.  Introduction 


Computer-assisted  translation  (CAT)  tools  provide  human  translators  with  a  software  system  to 
support  and  facilitate  their  translation  activities.  Typical  functionality  includes  (1)  Translation 
Memory  (TM),  which  contains  previously  translated  segments  that  are  automatically  searched 
and  used  to  suggest  possible  translations;  (2)  spell-checkers;  (3)  glossaries;  (4)  dictionaries; 

(5)  alignment  and  segmentation  tools  allowing  the  translator  to  decide  the  segment  boundaries 
(typically  sentence  or  paragraph);  and  most  recently,  (6)  machine  translation  (MT)  support. 

Statistical  machine  translation  (SMT)  is  a  paradigm  for  MT  based  on  statistical  analyses  of 
parallel  bilingual  text  data.  The  U.S.  Army  Research  Laboratory  (ARL)  has  developed  a 
technique  for  SMT  model  adaptation  and  applied  it  to  a  variety  of  specialized  domains  (medical, 
legal,  military,  agriculture,  etc.)  to  assist  English  to  Dari  document  translation  in  support  of 
Operation  Enduring  Freedom  (OEF)  in  Afghanistan.  Unlike  traditional  SMTs,  which  typically 
seek  to  be  as  general  purpose  as  possible,  many  of  these  domains  have  highly  specialized  and 
peculiar  jargon,  which  is  unlikely  to  be  captured  in  a  general  purpose  model.  Further 
complicating  the  translation  process  is  the  dearth  of  translators  with  bilingual  subject  matter 
expertise  for  the  low  resource  target  languages  of  Army  transition  and  training  operations  in 
Afghanistan.  In  the  course  of  site  visits  to  operational  units  in  theater,  researchers  at  ARE 
perceived  that  the  current  model  of  hiring  a  roomful  of  (expensive)  translators  to  translate 
individual  documents  is  wasteful,  in  that  it  fails  to  methodically  capture  the  expertise  of  these 
translators.  While  translated  documents  are  the  end  product  of  language  operations,  capturing  the 
translators’  expertise  in  an  automated  fashion  can  serve  to  bootstrap  the  translation  of  future 
documents  in  the  same  domain.  Moreover,  it  can  promote  consistent  use  of  technical  terms  in  the 
target  languages  and  increase  the  overall  effectiveness  of  human  translators,  especially  as  they 
rotate  in  and  out  of  their  job,  as  happens  frequently  in  Afghanistan  and  throughout  the  U.S. 
Central  Command  (CENTCOM).  ARL  researchers  developed  a  process  we  call  “iterative  post¬ 
editing  domain  adaptation  for  SMT”  to  capture  the  translators’  expertise.  Central  to  this  process 
is  both  the  careful  management  of  how  bilingual  subject  matter  experts  perform  their  translation 
work,  so  that  carefully  aligned  parallel  sentences  result,  and  a  very  short  software  development 
cycle,  which  presents  the  human  experts  with  draft  translations  incorporating  the  same  choices 
they  made  in  their  most  recent  translation  assignments.  As  SMT  models  are  recomputed  with 
each  new  chapter  or  “chunk”  of  text,  they  become  more  tightly  focused  on  the  domain  of 
interest. 

Once  these  highly  specialized,  domain-specific  SMT  models  have  been  created,  the  next  step  is 
to  determine  how  to  effectively  use  them  outside  the  laboratory  environment  in  which  they  were 
developed.  To  that  end,  we  created  a  Web  service  for  each  SMT  model,  or  SMT  engine,  to 
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accommodate  a  variety  of  different  front-ends  from  browser-based  thin  clients  to  workstation 
applications.  We  realized  that,  of  the  various  ways  that  government-owned  Web  translation 
services  might  be  offered  to  Army  users,  integration  with  existing  CAT  tools  might  be 
particularly  useful,  as  the  same  CAT  system  can  readily  provide  access  to  other  translator  tools, 
such  as  bilingual  glossaries  and  spell-checkers. 

This  report  describes  the  iterative  post-editing  domain  adaptation  algorithm,  as  well  as  a  method 
by  which  domain-specific  SMT  can  be  offered  as  Web  services  and  a  procedure  for  the 
integration  of  Web-based  translation  services  with  an  open-source  CAT  tool  called  OmegaT. 


2.  Iterative  Post-Editing  Domain  Adaptation  for  SMT 


The  current  model  in  the  Army  for  the  translation  of  low  resource  languages  (e.g.,  Dari)  for 
which  there  are  few  expert  translators  is  to  form  teams  of  bilingual  contractors  who  work 
together  under  a  supervising  staff.  With  today’s  emphasis  on  military  training  and  security 
assistance  operations,  such  teams  concentrate  on  translating  specific  types  of  documents  with  the 
aim  of  generating  high-quality  translations.  The  tools  these  teams  typically  have  available  are 
word  processors  and  access  to  some  existing  translation  resources  on  the  Web.  Some  well-led 
teams  share  glossaries  of  parallel  aligned  terms  (usually  in  the  form  of  Excel  spreadsheets),  but 
other  resources  to  help  automate  the  translation  process  are  typically  not  available.  To  further 
complicate  the  translation  work,  many  of  the  documents  identified  are  in  a  highly  specialized 
domain  (medicine,  legal,  agricultural,  etc.)  with  terms  that  may  not  be  familiar  to  the  translators. 
ARL  researchers  noted  that  this  current  model  focuses  almost  exclusively  on  the  translated 
document  itself  as  the  end  product,  to  the  detriment  of  translation  as  a  process.  As  a  result,  the 
expert  knowledge  that  is  generated  during  the  course  of  the  translation  is  preserved  only  in  the 
mind  of  the  human  translators.  When  human  translators  finish  their  assignments,  valuable 
knowledge  leaves  with  them. 

There  are  a  number  of  artifacts  generated  during  the  translation  effort,  which  with  the  help  of 
technology  can  be  preserved  to  further  assist  future  translations  within  the  highly  specialized 
domain.  One  method  for  preserving  expert  translation  knowledge  is  to  use  automatic  term 
extraction  to  identify  domain-specific  language  in  a  translated  document.  A  variant  of  tf-idf 
(Luhn,  1958;  Yu  and  Salton,  1976;  Robertson  and  Sparck  Jones,  1976)  can  be  used  to  identify 
the  most  important  phrases  and  terms  to  generate  a  glossary  for  the  domain.  Another  method  is 
to  create  parallel  text  from  the  archives  of  a  team’s  translation  work  and  align  that  text  at  the 
sentence  level;  such  a  corpus  can  then  be  used  to  produce  SMT  models  adapted  to  the  domain  in 
which  the  team  is  specialized.  In  turn,  this  SMT  can  be  used  to  generate  initial  draft  translations, 
or  translations  hypotheses,  for  successive  documents.  When  enough  parallel  data  is  available  for 
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a  mature  SMT  model,  the  translator  just  edits  the  translation  hypotheses  instead  of  starting  a  new 
translation  completely  from  scratch. 

Our  approach  to  capturing  this  expert  knowledge  is  to  think  of  translation  work  in  terms  of  a 
project,  each  of  which  deserves  its  own  SMT  model.  We  first  start  with  a  baseline  SMT  model 
trained  on  the  best  parallel  data  available  and  use  that  to  generate  hypotheses  for  a  portion  (we 
say  “chunk”)  of  the  entire  translation  project.  For  the  baseline  SMT  model  used  in  our  legal  Dari 
SMT  project,  ARL  had  collected  some  46,000  parallel  English/Dari  sentences,  only  some  of 
which  pertained  to  the  legal  domain.  These  were  used  to  produce  a  baseline  SMT  model  with 
which  the  first  chunk  of  the  translation  project  was  translated,  i.e.,  rendered  as  a  set  of 
hypotheses.  According  to  our  approach,  these  translation  hypotheses  are  then  given  to  a  human 
translator  to  correct,  or  post-edit.  When  the  human  is  finished  with  that  portion  of  the  document, 
the  results  are  used  to  retrain  the  SMT  model  and  retune  it  for  better  accuracy  on  the  domain  of 
interest.  This  resulting  new  SMT  model  is  then  used  to  generate  hypotheses  for  the  next  portion 
of  the  document.  This  process  continues  until  the  document  comprising  the  project  has  been 
completely  translated  with  the  end  result  being  an  SMT  model  highly  tuned  to  the  specific 
domain  (as  well  as  to  the  specific  human  translator  and  the  specific  document).  While  the 
process  has  been  shown  to  produce  high  quality  translations  with  excellent  consistency  of 
terminology,  it  raises  the  concern  of  overfitting  the  SMT  model  to  a  very  narrow  purpose  and  to 
the  preferences  of  a  single  translator. 

Happily,  the  continuous  nature  of  the  SMT  development  approach  means  the  final  SMT  model 
for  one  project  is  only  the  initial,  or  baseline,  SMT  model  for  the  next  one.  As  more  and  more 
documents  are  translated  by  different  human  translators,  the  effect  of  any  single  document  and 
any  single  translator  diminishes  and  the  SMT  model  effectively  becomes  more  representative  of 
the  team  operating  in  their  habitual  domain. 

A  formal  statement  of  the  algorithm  is  shown  in  listing  1  and  the  data  flow  diagram  is  shown  in 
figure  1. 
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Definitions 

Let  a  segment  be  a  sequence  of  terms.  Let  s  be  a  segment  in  a  source  language  and  t  a  segment  in  a  target  language. 
Let  an  ordered  pair  (s,t)  represent  a  translation  pair. 

Then  a  bilingual  parallel  corpus  C\—  {(. S/,t, ):  0<i</CI,  S/+0,  t/+0}  where  f,  is  assumed  to  be  an  expert  translation  of 
segment  s,-  from  the  source  to  target  language. 

Let  TRAIN/,  :=  {(. Sj/ ):  0<j< j  TRAIN/ /  Sj+ 0,  tj+0}  and  TUNE,  :=  ((Sj,tj):  0<j< /  TUNE,/,  s,A0,  tj+0}  where  f;  is 
assumed  to  be  an  expert  translation  of  segment  Sj  from  the  source  to  target  language  be  the  i  iteration  training  and 
tuning  sets  of  translation,  respectively. 

Let  TERM  Si  '■=  {term:  V  (Sj  G  nu  TRAIN i)  term  G  Sj}  be  the  set  of  all  source  language  terms  in  the  i,h 
iteration  training  set. 

Let  TM/(TRAIN i,TUNE/) the  translation  model  resulting  from  the  i'h  iteration  training  and  tuning  sets. 

Let  D  \—  the  document  to  be  translated  and  D,  (Z  D  :=  the  i'h  set  of  source  segments  to  be  translated. 

Let  a  translation  hypotheses  Hi  {(■$/,  tjj'.  Sj  G  D(  ]  where  ^  represents  a  proposed  translation  of  segment  Sj 
from  the  source  to  target  language.  Note  that  if  any  term  in  sj  is  out  of  vocabulary  (i.e.  it  doesn’t  appear  anywhere  in 
TERMS I),  tj  will  be  a  mix  of  the  source  and  target  language  and  if  all  of  the  terms  in  Sj  are  out  of  vocabulary,  t,  =  Sj. 

Let  a  translation  T,  '■  =  { (sj,tjj'.Sj  =£  0,  tj  ^  0,  V (s^Sj  G  ITU  Hi  ]  be  the  post-edited  human-generated 
translation  of  H,. 

Finally  let  OOVi  {(sy,  tj')'.  Sj  G  ITU  Tj,  3(term  G  Sj):  term  0  TERMSi j  be  the  set  of  all  source 
segments  in  7}  where  there  is  at  least  one  term  in  sj  which  does  not  appear  in  TERMS/. 

Initial  Conditions 

TRAINq  and  TUNE q  are  nonempty  disjoint  proper  subsets  of  C  the  union  of  which  is  C.  TRAIN0  C 
C,  TUNE0  c  c,  TRAINq  U  TUNE0  =  C  ,  TRAINq  n  TUNE0  =  0,  TRAINo  *  0  ,  TUNE0  *  0  and 
the  size  of  TUNEq  is  much  less  than  the  size  of  TRAIN0:  \TUNE0\  «  \T RAI N0  \  . 

Algorithm 

i=0 

while  D  A  0 

//  train  and  tune  the  ith  translation  model 
TM/(TRAIN/,  TUNE/) 

//select  the  ith  subset  of  segments  left  in  D  to  translate 
Select  D/ 

//  Update  D 
D  -  D  -  D/ 

//  Generate  the  1th  translation  hypotheses  by  decoding  D, 

H/  =  DecodefTM/,  D,) 

//  Give  the  hypothesis  to  the  translator 
T/  =  Postedit(Hj) 

//  Update  the  ith+l  training  set  with  all  the  out  of  vocabulary  translation  pairs 
TRAIN  i+1  =  TRAIN i  U  OOVi 
//  Update  Ti 
T/  —  T/  -  OOV/ 

//  Split  the  reminder  of  Ti  between  theith+l  training  and  tuning  sets 
{ NewTrain,  NewTune }  —  Split(Tj) 

//  Update  the  ith+ltraining  and  tuning  sets 
TRAIN  i+1  —  TRAIN i  U  NewTrain 
TUNE  i+1  =  TUN Ei  U  NewTune 
i  =  i+1 

done 

Listing  1 .  Iterative  post-editing  domain  adaptation  for  SMT  algorithm. 
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Figure  1.  Iterative  post-editing  domain  adaptation  data  flow. 


3.  Related  Work 


In  the  field  of  CAT,  most  modem  CAT  tools  already  provide  some  level  of  access  to  MT  and 
some  work  has  been  done  to  quantify  its  impact  on  interactive  translation  (Koehn,  2009;  Plitt  and 
Messelot,  2010),  but  these  systems  typically  do  not  provide  MT  support  for  low  resource 
languages  like  Dari.  Incorporating  MT  for  new  language  pairs  or  tailoring  an  existing  model  to  a 
specific  domain  in  proprietary  CAT  tools,  as  with  SMT  domain  adaptation,  also  incurs  additional 
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costs  since  the  MT  vendor  is  usually  involved.  Government-owned  SMT  models  make  with 
open-source  SMT  systems  like  Joshua  and  open-source  CAT  tools  like  OmegaT  avoid  these 
costs  while  providing  versatility  for  translation  work  specific  to  domains  of  military  interest. 

Other  researchers  have  looked  at  applying  SMT  as  a  “statistical  post-editing”  step  to  improve  the 
quality  of  output  from  rule -based  MT  systems  like  SYSTRAN.  Their  focus  is  on  combining  the 
two  MT  paradigms  to  boost  overall  accuracy,  while  ours  instead  is  on  repeatedly  using  human- 
in-the-loop  translators  to  further  refine  and  adapt  SMT  models  within  the  context  of  a  single 
paradigm  (Dugast,  Senellart,  and  Koehn,  2007;  Simard  et  al.,  2007;  Terumasa,  2007;  Lagarda  et 
al„  2009). 

Some  related  work  has  also  been  done  on  using  active  learning  for  SMT.  Most  of  the  work 
focuses  on  optimizing  sentence  selection  to  give  to  the  human  translators  (Callison-Burch,  2003; 
Mandal  et  al.,  2008;  Haffari  and  Sarkar,  2009;  Ambati  and  Vogel,  2010;  Gonzalez-Rubio,  Ortiz- 
Martinez,  and  Casacuberta,  2011;  Ananthakrishnan  et  al.,  2011;  Bakhshaei  and  Khadivi,  2012). 

In  our  work,  sentence  selection  is  driven  by  their  order  in  the  document.  To  generate  the  highest 
quality  translation,  the  translators  are  given  the  sentences  in  context.  Other  related  work  on  SMT 
domain  adaptation  include  the  use  of  mixture  models  and  classifiers  and  metrics  to  distinguish  a 
particular  domain  and/or  incorporate  multiple  domain  models  in  a  single  system  (Bertoldi  and 
Federico,  2009;  Foster,  Goutte,  and  Kuhn,  2010;  Wang  et  al.,  2012;  Sennrichm  2012).  In  our 
work,  the  focus  is  on  improving  the  quality  of  the  translation  and  the  experience  of  the 
translators  where  the  domain  and  SMT  model  is  known  beforehand.  While  some  commercial 
SMT  systems  like  SDL  Language  Weaver  (http://www.sdl.com/products/sdl-enterprise- 
language-server/)  have  provisions  to  generate  SMT  models  from  user  data  for  highly  specialized 
domains  similar  to  our  iterative  post-editing  domain  adaptation  algorithm,  such  customized 
trainings  incur  significant  software  licensing  costs  and  annual  software  maintenance  costs. 

Morgan  (2010)  first  used  this  iterative  post-editing  domain  adaptation  algorithm  to  assist  in  the 
translation  of  the  medical  training  manual  Fundamental  Critical  Care  Support,  published  by  the 
Society  for  Critical  Care  Medicine,  into  Dari.  He  used  it  again  in  201 1  to  translate  U.S.  Army 
Field  Manual  7-8  ( The  Rifle  Platoon  and  Squad)  from  English  into  Pashto  (Morgan,  2011).  In  his 
reports,  he  notes  increased  translation  accuracy  as  measured  in  terms  of  rising  Bilingual 
Evaluation  Understudy  (BLEU)  scores  as  each  chapter  or  chunk  of  the  manual  is  automatically 
rendered  as  a  draft  translation,  corrected  (post-edited)  by  a  human  expert,  and  then  used  to 
retrain  and  retune  the  SMT  model  before  it  is  used  to  produce  a  draft  translation  of  the  next 
chapter  or  chunk.  The  work  here  builds  on  Morgan’s  work  but  is  more  selective  in  which  of  the 
post-edited  translations  go  into  the  training  set  and  which  go  into  the  tuning  set.  Instead  of 
retraining  and  retuning  on  the  entire  chunk  of  post-edited  text,  any  source  language  sentences 
that  contain  out  of  vocabulary  (OOV)  terms  go  into  the  training  set  along  with  their  translation, 
so  the  SMT  model  has  at  least  one  translation  for  the  OOV  item.  The  remaining  parallel  text  is 
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split  between  the  training  and  tuning  sets  by  a  user-defined  parameter  (we  used  50%  in  training 
and  50%  in  tuning). 


4.  Web  Service  Front-End  to  the  Joshua  SMT  System 


Once  we  have  a  suite  of  refined,  highly  tuned,  domain- specific  SMT  models,  we  want  to  expose 
them  for  use  by  other  applications.  The  SMT  decoder  we  used  for  this  effort  is  the  open-source 
Joshua  decoder  developed  by  the  Center  for  Language  and  Speech  Processing  and  the  Human 
Language  Technology  Center  of  Excellence  at  the  Johns  Hopkins  University  (http://joshua- 
decoder.org/).  Proprietary  SMT  decoders  currently  in  use  by  the  Army  come  with  a  high  cost. 
Some  systems  have  license  costs  greater  than  $100,000  with  as  much  as  a  15%  annual 
maintenance  fee.  The  Joshua  decoder  is  free  and  provides  a  mechanism  for  individual 
researchers  and  users  to  retrain  and  retune  the  SMT  on  different  domains.  In  a  research 
environment,  the  standard  stable  release  of  the  Joshua  decoder  is  typically  invoked  via  the 
command  line.  A  user  sends  a  list  of  source  language  sentences  and  receives  translation 
hypotheses  back.  An  issue  with  this  command-line  interface  is  that  the  entire  SMT  engine  is 
reinitialized  for  every  list  of  sentences,  which  is  inefficient  and  results  in  long  latencies.  The 
development  branch  of  the  Joshua  decoder,  however,  contains  a  client-server  interface  where  the 
client  application  sends  the  sentence(s)  over  a  TCP/IP  socket  and  the  server  sends  the  translation 
hypotheses  back  without  having  to  reinitialize  the  system  each  time.  While  this  efficiently  and 
effectively  exposes  the  SMT  models,  it  requires  the  client  application  to  reside  on  the  same 
network  where  the  server  lives.  To  expose  the  SMT  models  to  a  wider  community,  we  wrapped 
the  Joshua  decoder  server  with  a  simple  Representational  State  Transfer  (REST)  Web  service. 
For  each  domain,  we  create  a  different  instance  of  the  Joshua  decoder  each  with  a  different  SMT 
model  and  URL  for  the  Web  service  as  shown  in  figure  2. 


Dari  Legal  SMT  Model 


Dari  Medical  SMT  Model 


Dari  Agriculture  SMT  Model 


Joshua  Decoder  I  Joshua  Decoder  Joshua  Decoder  I 

172.18.30.10:10100  172.18.|0.10:10101  172.18.30.10:10102 

English  English  English 

n  u  u 


Dari 


Dari 


Dari 


http://www.translate.mil/DariLegal 


http://www.translate.mil/DariMed 

http://www.translate.mil/DariAg 

Figure  2.  Deployment  of  various  domain-specific  SMT  models. 
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The  Web  service  itself  is  extremely  lightweight  and  simple.  Clients  HTTP  POST  one  or  more 
English  sentences  to  the  URL  and  the  HTTP  Response  contains  the  translations.  A  typical 
translation  transaction  is  shown  in  listings  2  and  3. 

POST  http://www.translate.mil/DariLegal  HTTP/1.1 

Fingerprints,  although  they  may  be  found  50  years  after  being  deposited  on  a  piece  of 
paper,  are  at  the  same  time  very  fragile  and  easily  destroyed. 


Listing  2.  Example  English  HTTP  POST. 


HTTP/1.1  200  OK 


oAiiSLdu  jLiuiJ  (jJC.  t  AlL-uA  j)\  -Iaj  j xA  JLuj  d  * 

C*  u*il  ^jLuiI  Aj  j  . 


:  Cl  Util  ^jl  1_j  c  Ci2i5L>l  L  ° 


Listing  3.  Example  Dari  response. 

When  a  client  makes  a  connection  and  submits  the  English  sentences,  the  Web  service  first  stores 
the  sentences  in  a  temporary  file.  Since  the  Joshua  decoder  works  best  when  the  sentences  are 
tokenized  and  lower  case,  a  pipeline  of  Perl  scripts  is  invoked.  First,  a  script  to  tokenize  the 
sentences  is  executed;  the  result  is  then  sent  to  another  script,  which  converts  it  to  lowercase  and 
then  the  final  result  is  sent  to  the  Joshua  decoder  server  via  the  Netcat  utility  and  the  Web  service 
captures  the  result  from  stdout.  The  data  flow  is  shown  in  figure  3  and  the  shell  script  in  listing  4. 
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Figure  3.  English-to-Dari  translation  dataflow. 


cat  EnglishTemp.txt  I  perl  tokenizer.pl  I  perl  lowercase.pl  I  nc  localhost  10101 


Listing  4.  English-to-Dari  translation  shell  script. 


5.  Integration  with  OmegaT 


OmegaT  is  a  free  CAT  tool  intended  for  professional  translators.  Its  list  of  features  include 
creation  and  management  of  TM;  TM  search  capabilities;  fuzzy  matching  against  TMs  to 
propose  translations;  spell-checking,  glossary,  and  dictionary  look-up;  support  for  multiple  file 
formats;  regular  expressions;  programmable  segmentation;  and  auto  replacement  of  suggested 
translations  (http://www.omegat.org/en/omegat.html). 

OmegaT  was  selected  as  the  CAT  tool  to  integrate  with  because  it’s  open  source  and  extensible. 
OmegaT  is  written  in  the  Java  language  with  extensibility  provided  via  interfaces  and  abstract 
base  classes.  Prepackaged  MT  plug-ins  are  provided  for  translations  from  Apertium 
(http://www.apertium.org),  Belazar,  and  Google  Translate  (http://translate.google.com/).  To 
facilitate  the  incorporation  of  additional  MT  systems,  OmegaT  defines  an  interface 
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IMachineTranslation,  which  specifies  two  functions  for  getting  the  name  of  the  translation 
plugin  (getNameQ)  and  the  translation  itself  (getTranslation(. . .)),  and  an  abstract  base  class 
BaseTranslate,  which  implements  IMachineTranslation ,  hooks  the  translation  system  to  the  user 
interface,  and  specifies  two  abstract  methods  getPreferenceNameQ  and  translate^. . .).  To  add  a 
new  MT  system  to  OmegaT  merely  requires  extending  the  BaseTranslate  class,  overriding  the 
getNameQ,getPreferenceName()  and  translate^. . .)  methods,  and  adding  a  new  property  to  the 
Bundles.properties  file  to  indicate,  which  menu  the  item  should  appear  and  how  it  should  appear 
(MT_ENGINE_JOSHUA=Dari  Legal).  The  Unified  Modeling  Language  (UML)  class  diagram 
is  shown  in  figure  4. 


Figure  4.  JoshuaDariLegalTranslate  class  diagram. 

To  use  the  system,  the  translator  first  enables  the  translation  engine  by  selecting  it  from  the 
Options  menu  (multiple  translation  engines  can  be  active  simultaneously,  see  figure  5).  As  the 
user  processes  each  segment  in  the  Editor  panel  on  the  left,  the  proposed  Dari  translation  is 
shown  in  the  Machine  Translation  panel  on  the  right  (figure  6). 
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OmegaT-2.5.5_4 ::  FingerprintManual 

Options 


Editor  -  chapter!  1  .docx 

Use  IAB  to  Advance 

_  n 

11.1 

Always  Confirm  Quit 

A 

ll.l<segment  0001> 

Machine  Translate 

*  1  Dari  Legal 

Introduction 

Glossary 

Google  Translate  v2 

Fingerprints,  although  they  may  be  foi 

TransTips 

¥  Apertium 

paper,  are  at  the  same  time  very  fragi 

Font... 

Belazar 

The  arrival  of  a  fingerprint  technician  « 

File  Eilters... 

oint  in  an 

investigation. 

Segmentation,.. 

It  is  what  he  or  she  decides  to  do,  eve 
fingerprint  evidence  collection. 

Spellchecking... 

Editing  Behaviour.., 

success  or  failure  of 

A  technician  must  be  knowledgeable  * 
and  in  the  laboratory. 

Tag  Validation... 

ole  both  in  the  field 

With  this  knowledge,  the  technician  wi 

Team... 

>d  for  developing  and 

preserving  a  print. 

ExternalTMXs... 

This  chapter  focuses  on  equipment  th 

View... 

ind  equipment  that 

would  be  found  in  the  laboratory  settii 

Save... 

There  will,  of  course,  be  some  overlap 

Proxy  Login... 

oratory  equipment. 

11.2 

Restore  Main  Window 

Fuzzy  Matches  Machine  Translation 


Crime  Scene  Equipment 

11.2.1 

Light  Sources 

A  light  source  may  include  any  item  that  produces  electromagnetic  radiation  of  any 
wavelength  (from  ultraviolet  to  infrared). 

Light  sources  are  indispensable  to  a  crime  scene  responder  and  a  variety  of  them  are  useful. 


Dictionary 


_  m 


Glossary 


_  m 


Multiple  Translations  Notes  Comments 


1 0/320  (0/604,  621)|  [yil 


Figure  5.  Enabling  the  Joshua  Dari  Legal  SMT  model  from  OmegaT. 
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OmegaT-2.5.5_4 ::  FingerprintManual 


Editor- chapter'll. docx 
ll.l 


.  ^Ljli  JiLi  ti-IS  j U  i Sj  jj  Jibuii  j\  sju  /ajt  JLu  0*^ 
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It  is  what  he  or  she  decides  to  do,  even  unwittingly,  that  may  affect  the  success 
:  or  failure  of  fingerprint  evidence  collection. 

It  is  what  he  or  she  decides  to  do.  even  unwittingly,  that  may  affect  the  success  or  failure  of 
fingerprint  evidence  collection. <segment  0005> 

A  technician  must  be  knowledgeable  about  the  equipment  that  is  available  both  in  the  field 
and  in  the  laboratory. 

With  this  knowledge,  the  technician  will  be  able  to  select  the  best  method  for  developing 
and  preserving  a  print. 

This  chapter  focuses  on  equipment  that  can  be  used  easily  in  the  field  and  equipment  that 
would  be  found  in  the  laboratory  setting. 

There  will,  of  course,  be  some  overlap  between  the  crime  scene  and  laboratory  equipment. 

11.2 

Crime  Scene  Equipment 

11.2.1 

Light  Sources 

A  light  source  may  include  any  item  that  produces  electromagnetic  radiation  of  any _ 

Multiple  Translations  Notes  Comments 

Project  autosaved  on  3:03  PM 


Fuzzy  Matches  Machine  Translation 
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<Dari  Legal> 


I  uJL>  ji  i>il 


Dictionary 


_  m 


Glossary 
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|3/320(3/604,  621 )  |  pTTs/l  2S  | 


Figure  6.  Using  the  Dari  Legal  SMT  model  in  OmegaT. 


6.  Conclusion 


In  this  report,  we  have  described  a  human-in-the-loop  Iterative  Post-Editing  Refinement 
algorithm  for  generating  SMT  models  for  highly  specialized  domains.  Additionally,  we 
described  integrating  these  SMT  models  with  the  open-source  Joshua  decoder  and  exposing 
these  models  as  Web  services,  making  them  available  to  any  user  or  system  with  Internet  access. 
Finally,  we  demonstrated  how  one  such  system  might  use  these  services  by  integrating  them  with 
the  open-source  CAT  tool  OmegaT.  We  are  in  the  process  of  translating  the  National  Institute  of 
Justice’s  Fingerprint  Sourcebook  from  English  into  Dari  and  will  report  on  improvements  in 
BLEU  score  once  complete. 
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List  of  Symbols,  Abbreviations,  and  Acronyms 


ARL 

U.S.  Army  Research  Laboratory 

BLEU 

Bilingual  Evaluation  Understudy 

CAT 

computer-assisted  translation 

CENTCOM 

U.S.  Central  Command 

MT 

machine  translation 

OEF 

Operation  Enduring  Freedom 

OOV 

out  of  vocabulary 

REST 

Representational  State  Transfer 

SMT 

statistical  machine  translation 

TM 

Translation  Memory 

UML 

Unified  Modeling  Language 
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